Yoga Studio and Gentrification
Gentrification is the process of urban transformation in which neighborhoods undergo socioeconomic changes, typically characterized by an influx of higher-income residents and investment. This often leads to increased property values and rents, which can displace lower-income residents and alter the cultural landscape of the area. While gentrification can improve infrastructure and amenities, it raises concerns about social equity, displacement, and the loss of community identity.
“Seemingly overnight, a yoga studio replaces a barbershop, a coffee café takes over a small grocery, and a multi-story apartment building looms where older single-family homes once stood.” (Katharine Miller, 2024))
In today’s lab session, you will be answering this question: Are yoga studios associated with gentrification?
Operational Definition of Gentrification
Gentrification is about change; the transformation of a neighborhood once affordable becoming unaffordable, so we have two conditions: “once affordable” and “becoming unaffordable”.
To create a simple operational definition, let’s say:
- Measure of affordability: real estate tax
- Time period: 2010 - 2019
- “Once affordable”: real estate tax in 2010 is below median value
- “Becoming unaffordable”: change of real estate tax between 2010 and 2019 is greater than the median value
Then we can get these three categories: Gentrifying, Not-gentrifying, and Not-gentrifiable
| Significant RE Tax Increase (2010-2019) | No Significant RE Tax Change/Decrease (2010-2019) | |
|---|---|---|
| Low RE Tax (2010) | Gentrifying | Not-gentrifying |
| High RE Tax (2010) | Gentrified | Gentrified |
Data
We need:
- Census ACS “median real estate tax” data for 2010 and 2019 in Fulton and Dekalb Counties
- Yoga studios within the two counties (will be provided)
Let’s first download the ACS 5-year estimate data for 2010 and 2019 for the two counties at the Tract level
tract_2010 <- suppressMessages(
get_acs(geography = ????,
state = ????,
county = c(????),
variables = c(med_re_tax = "B25103_001"),
year = 2010,
survey = ????,
geometry = TRUE,
output = ????)) %>%
select(GEOID, med_re_tax_2010 = med_re_taxE)
tract_2019 <- suppressMessages(
get_acs(geography = ????,
state = ????
county = c(????),
variables = c(med_re_tax = "B25103_001"),
year = 2019,
survey = ????,
geometry = FALSE, # why False?
output = ????)) %>%
select(GEOID, med_re_tax_2019 = med_re_taxE)
Join the two data and calculate the percentage change of median real estate tax between 2010 and 2019.
tract <- left_join(tract_2010,
tract_2019,
by = "GEOID") %>%
mutate(med_re_tax_change = (med_re_tax_2019 - med_re_tax_2010)/med_re_tax_2010) %>%
drop_na(med_re_tax_change)
Let’s classify Census Tracts by the three categories we defined.
print(paste0("Median value of median real estate tax in 2019: $",
median(tract$med_re_tax_2010) %>% round(2)))
## [1] "Median value of median real estate tax in 2019: $2125"
print(paste0("Median value of the percentage change of median real estate tax between 2010 and 2019: ",
median(tract$med_re_tax_change) %>% round(2)))
## [1] "Median value of the percentage change of median real estate tax between 2010 and 2019: 0.07"
tract <- tract %>%
mutate(gentri_category = case_when(
med_re_tax_2010 >= median(med_re_tax_2010) ~ "Gentrified",
med_re_tax_change >= median(med_re_tax_change) ~ "Gentrifying",
TRUE ~ "Not-gentrifying"))
print(table(tract$gentri_category))
##
## Gentrified Gentrifying Not-gentrifying
## 165 51 114
Let’s display what we have!
## tmap mode set to interactive viewing
tmap_mode("view")
tm_shape(tract) + tm_polygons(col = "med_re_tax_2019")
tm_shape(tract) + tm_polygons(col = "med_re_tax_change")
tm_shape(tract) + tm_polygons(col = "gentri_category")
Now download the Yelp data. This has been done for you.
To save time in class, we will use the Yelp data that is already downloaded for today’s class.
This data is for Fulton and DeKalb County, GA, and contains Yelp data with categories = “yoga”. This data is already cleaned. To read the data into R, we will use st_read() function in sf package.
# Reading the yelp data
yelp_yoga <- st_read("https://raw.githubusercontent.com/ujhwang/urban-analytics-2024/main/Lab/module_2/week1/yelp_yoga.geojson")
## Reading layer `yelp_yoga' from data source
## `https://raw.githubusercontent.com/ujhwang/urban-analytics-2024/main/Lab/module_2/week1/yelp_yoga.geojson'
## using driver `GeoJSON'
## Simple feature collection with 328 features and 24 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -84.73616 ymin: 33.52442 xmax: -84.09178 ymax: 34.12817
## Geodetic CRS: WGS 84
Let’s see if the results look fine
Get a summary using skimr::skim() and then display it
(i.e., the point locations of yoga studios) on the Census Tract map.
skim(yelp_yoga)
| Name | yelp_yoga |
| Number of rows | 328 |
| Number of columns | 25 |
| _______________________ | |
| Column type frequency: | |
| character | 19 |
| logical | 1 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 22 | 22 | 0 | 328 | 0 |
| alias | 0 | 1.00 | 11 | 66 | 0 | 328 | 0 |
| name | 0 | 1.00 | 3 | 50 | 0 | 296 | 0 |
| image_url | 0 | 1.00 | 0 | 68 | 44 | 285 | 0 |
| url | 0 | 1.00 | 168 | 223 | 0 | 328 | 0 |
| categories | 0 | 1.00 | 4 | 51 | 0 | 150 | 0 |
| transactions | 0 | 1.00 | 0 | 0 | 328 | 1 | 0 |
| phone | 0 | 1.00 | 0 | 12 | 16 | 298 | 0 |
| display_phone | 0 | 1.00 | 0 | 14 | 16 | 298 | 0 |
| price | 321 | 0.02 | 2 | 3 | 0 | 2 | 0 |
| location.address1 | 48 | 0.85 | 0 | 31 | 31 | 235 | 0 |
| location.address2 | 112 | 0.66 | 0 | 15 | 90 | 101 | 0 |
| location.address3 | 82 | 0.75 | 0 | 26 | 238 | 9 | 0 |
| location.city | 0 | 1.00 | 6 | 19 | 0 | 26 | 0 |
| location.zip_code | 0 | 1.00 | 0 | 5 | 3 | 62 | 0 |
| location.country | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| location.state | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| location.display_address | 0 | 1.00 | 11 | 76 | 0 | 289 | 0 |
| geometry | 0 | 1.00 | 19 | 38 | 0 | 316 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| is_closed | 0 | 1 | 0 | FAL: 328 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| review_count | 0 | 1 | 12.98 | 28.67 | 0.00 | 0.00 | 1.00 | 12.00 | 184.00 | ▇▁▁▁▁ |
| rating | 0 | 1 | 2.57 | 2.24 | 0.00 | 0.00 | 3.50 | 4.80 | 5.00 | ▇▁▁▂▇ |
| distance | 0 | 1 | 1975.61 | 3754.22 | 73.20 | 770.04 | 1276.82 | 1928.48 | 34920.47 | ▇▁▁▁▁ |
| coordinates.latitude | 0 | 1 | 33.85 | 0.12 | 33.52 | 33.76 | 33.81 | 33.92 | 34.13 | ▁▅▇▂▃ |
| coordinates.longitude | 0 | 1 | -84.35 | 0.08 | -84.74 | -84.39 | -84.36 | -84.31 | -84.09 | ▁▁▇▆▁ |
tm_shape(yelp_yoga) + tm_dots(col="red") + tm_shape(tract) + tm_borders()
Appending Census data
Joining census data in this case is somewhat different from what we’ve previously done. We will do the following:
- make sure the CRS of the two sf objects are same
- Calculate how many yoga studios are within each census tract (there will be many tracts with 0 studios)
- Join the two files with the count of Yoga studios
## Check to see that the CRS for both the sf files are the same
head(tract$geometry)
## Geometry set for 6 features
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -84.34921 ymin: 33.74743 xmax: -84.23807 ymax: 33.94607
## Geodetic CRS: NAD83
## First 5 geometries:
## MULTIPOLYGON (((-84.34918 33.75992, -84.34918 3...
## MULTIPOLYGON (((-84.34919 33.75242, -84.34918 3...
## MULTIPOLYGON (((-84.33002 33.88039, -84.33468 3...
## MULTIPOLYGON (((-84.33058 33.94208, -84.32877 3...
## MULTIPOLYGON (((-84.23807 33.90507, -84.2442 33...
head(yelp_yoga$geometry)
## Geometry set for 6 features
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -84.36553 ymin: 33.77094 xmax: -84.27338 ymax: 34.07155
## Geodetic CRS: WGS 84
## First 5 geometries:
## POINT (-84.27338 34.07155)
## POINT (-84.34933 33.77094)
## POINT (-84.35977 33.85101)
## POINT (-84.36167 33.85179)
## POINT (-84.36553 33.84621)
## It turns out they are NOT! So we need to transform their CRS into the same
tract <- tract %>% st_transform(crs=4326) # you can skip 'crs='
yelp_yoga <- yelp_yoga %>% st_transform(crs=4326)
# Join tract geometry with yoga studios
tract_yoga <- st_join(tract, yelp_yoga %>% mutate(count = 1))
# Calculate the count of yoga studios at the tract level
tract_yoga_count <- tract_yoga %>%
group_by(GEOID) %>%
summarise(count = sum(count, na.rm = T))
## Now we are ready to join back the counts of yoga studios to the Tract data
tract <- tract %>%
left_join(tract_yoga_count %>% st_drop_geometry(),
by = "GEOID")
Make sure that the final dataset we will be using for our analysis is as expected
Let’s skim and map the data
????(tract)
# Let's check to see if the points of yoga studios and their counts in the polygon data match
tm_shape(????) + tm_polygons(col=????) +
tm_shape(????) + tm_dots()
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No
## user-defined `sfl` provided. Falling back to `character`.
| Name | tract |
| Number of rows | 330 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| GEOID | 0 | 1 | 11 | 11 | 0 | 330 | 0 |
| gentri_category | 0 | 1 | 10 | 15 | 0 | 3 | 0 |
| geometry | 0 | 1 | 195 | 3246 | 0 | 330 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| med_re_tax_2010 | 0 | 1 | 2683.02 | 1799.44 | 263.00 | 1409.75 | 2125.00 | 3421.00 | 10001.0 | ▇▅▂▁▁ |
| med_re_tax_2019 | 0 | 1 | 2940.77 | 2190.89 | 199.00 | 1253.50 | 2326.00 | 4192.25 | 10001.0 | ▇▅▃▁▁ |
| med_re_tax_change | 0 | 1 | 0.05 | 0.36 | -0.94 | -0.19 | 0.07 | 0.23 | 2.5 | ▂▇▁▁▁ |
| count | 0 | 1 | 0.98 | 1.60 | 0.00 | 0.00 | 0.00 | 1.00 | 10.0 | ▇▁▁▁▁ |
Now we are ready to ask some probing questions about the data
Are yoga studios more likely be in neighborhoods with high median real estate tax (2019)?
ggplot(tract, aes(x=????, y=count)) +
geom_point() +
ylab("Number of Yoga Studios") +
xlab("Median Real Estate Tax in Tract")
Not much of an association…
Are yoga studios more likely be in neighborhoods with increasing median real estate tax?
ggplot(tract, aes(x=????, y=count)) +
geom_point() +
ylab("Number of Yoga Studios") +
xlab("% Change of Real Estate Tax (2010-2019) in Tract")
Hmm, it’s showing some trend?
Are yoga studios in gentrifying neighborhoods?
tm_shape(tract) + tm_polygons(col=????) +
tm_shape(yelp_yoga) +tm_dots()
Most of yoga studios are in neighborhoods that are already gentrified.
A good way to quantify what we intuitively grasped from the previous map is to create a “boxplot”
boxplot(count ~ gentri_category, data=tract,
main="Boxplot of Yoga Studios by Neighborhood Type",
xlab="Neighborhood types",
ylab="Yoga studio count")
Gentrifying neighborhoods have more yoga studios than not-gentrifying neighborhoods, but less than already gentrified neighborhoods.
Let’s convert it into a binary case: whether a Tract has a yoga studio or not.
tract <- tract %>%
mutate(yoga_binary = count > 0)
yoga_binary_summary <- tract %>%
st_drop_geometry() %>%
group_by(yoga_binary, gentri_category) %>%
summarize(n = n()) %>%
ungroup() %>%
mutate(proportion = n / sum(n))
## `summarise()` has grouped output by 'yoga_binary'. You can override using the
## `.groups` argument.
yoga_binary_summary %>%
ggplot(aes(x = gentri_category, y = proportion, fill = yoga_binary)) +
geom_bar(stat = 'identity', position = 'fill')
A similar story to the previous box plot.
Finally, let’s run a regression model.
- Dependent variable: count of yoga studios
- Independent variable: type of neighborhood in terms of gentrification
- Control variables:
- Educational attainment (Rationale: higher education levels may correlate with interest in yoga)
- Median household income (Rationale: income levels can influence the presence of amenities like yoga studios)
- Median age (Rationale: Age demographics could influence the demand for yoga studios)
- Population density (Rationale: denser areas might have more yoga studios due to higher foot traffic)
- Racial/ethnic composition (Rationale: demographic makeup might correlate with yoga studio presence)
- % of population who commute by walking (Rationale: it indicates pedestrian-friendly areas which may be more likely to have yoga studios)
Let’s get these control variables.
tract_control_vars <- suppressMessages(
get_acs(geography = ????,
state = ????,
county = c(????),
variables = c(edu_bachelor = "B06009_005",
edu_graduate = "B06009_006",
hhincome = "B19013_001",
median_age = "B01002_001",
pop = "B01003_001",
non_hispanic_white = "B03002_003",
commute_walk = "B08101_033",
commute_total = "B08101_001"),
year = 2019,
survey = ????,
geometry = TRUE,
output = ????)) %>%
mutate(high_edu_pct = (edu_bachelorE + edu_graduateE)/popE,
hhincome = hhincomeE,
median_age = median_ageE,
minority_pct = (popE - non_hispanic_whiteE)/popE,
commute_walk_pct = commute_walkE/commute_totalE,
pop = popE)
# Calculate land area of Census Tract and get population density using the area
tract_control_vars <- tract_control_vars %>%
mutate(area = st_area(.) %>% set_units(mi^2)) %>%
mutate(pop_density = pop/area)
# Remove unnecessary columns including geometry
tract_control_vars <- tract_control_vars %>%
select(GEOID, high_edu_pct, hhincome, median_age, minority_pct, commute_walk_pct, pop_density) %>%
st_drop_geometry()
# Join the control variables to the original Tract data
tract <- tract %>%
left_join(tract_control_vars, by = ????)
## | | | 0% | |= | 1% | |== | 3% | |==== | 5% | |====== | 8% | |======= | 10% | |======== | 12% | |========= | 14% | |=========== | 15% | |============ | 18% | |============== | 21% | |=============== | 21% | |====================== | 31% | |============================ | 40% | |============================= | 41% | |============================== | 43% | |=============================== | 44% | |=============================== | 45% | |================================= | 47% | |=================================== | 50% | |==================================== | 51% | |==================================== | 52% | |===================================== | 53% | |====================================== | 55% | |======================================== | 56% | |======================================== | 57% | |========================================= | 59% | |=========================================== | 61% | |=========================================== | 62% | |============================================= | 64% | |=============================================== | 67% | |================================================= | 70% | |=================================================== | 73% | |===================================================== | 75% | |======================================================= | 78% | |======================================================== | 80% | |========================================================== | 83% | |=========================================================== | 84% | |============================================================= | 87% | |============================================================== | 89% | |================================================================ | 91% | |================================================================= | 93% | |================================================================== | 94% | |=================================================================== | 96% | |==================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 100%
Convert the gentri_category into a factor column where
“Gentrifying” comes first in the category. (I will explain why)
tract <- tract %>%
mutate(gentri_category = factor(gentri_category,
levels = c("Gentrifying",
"Not-gentrifying",
"Gentrified")))
Run multiple linear regression.
m1 <- lm(count ~ gentri_category + high_edu_pct + hhincome +
median_age + minority_pct + commute_walk_pct + pop_density,
data=tract)
summary(m1)
##
## Call:
## lm(formula = count ~ gentri_category + high_edu_pct + hhincome +
## median_age + minority_pct + commute_walk_pct + pop_density,
## data = tract)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4121 -0.7967 -0.2649 0.3558 7.8332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.959e+00 9.824e-01 1.994 0.04700 *
## gentri_categoryNot-gentrifying -2.442e-01 2.525e-01 -0.967 0.33429
## gentri_categoryGentrified 5.667e-01 3.036e-01 1.866 0.06291 .
## high_edu_pct 1.748e+00 1.070e+00 1.634 0.10331
## hhincome -1.566e-05 3.576e-06 -4.379 1.61e-05 ***
## median_age 1.269e-02 1.684e-02 0.753 0.45179
## minority_pct -1.420e+00 7.132e-01 -1.991 0.04738 *
## commute_walk_pct 7.689e+00 2.179e+00 3.528 0.00048 ***
## pop_density -8.441e-05 3.158e-05 -2.673 0.00789 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.431 on 321 degrees of freedom
## Multiple R-squared: 0.2227, Adjusted R-squared: 0.2033
## F-statistic: 11.5 on 8 and 321 DF, p-value: 2.324e-14
How would you interpret this result?